
1. IntroductionΒΆ
Stock Market Analysis and Prediction is the project related to Exploratory data analysis (EDA), Data visualization and Predictive analysis using data, provided by The Investors Exchange (IEX). I looked at real-time financial data from the stock market. I have used python libraries to get stock information, visualize different aspects of it, and finally I worked at a few ways of analyzing the risk of a stock, based on its previous performance history. I have also used statistical method called Monte Carlo Method to predict future stock prices.
We'll be answering the following questions along the way:
- What was the change in price of the stock over time?
- What was the daily return of the stock on average?
- What was the moving average of the various stocks?
- What was the correlation between different stocks' closing prices?
- What was the correlation between different stocks' daily returns?
- How much value do we put at risk by investing in a particular stock?
- How can we attempt to predict future stock behavior?
2. Loading dataΒΆ
2.1. Importing librariesΒΆ
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import display
import matplotlib.pyplot as plt
from datetime import datetime
import yfinance as yf
from scipy.stats import norm
from datetime import datetime
sns.set_style('dark')
sns.set(rc={"figure.dpi":500, 'savefig.dpi':500})
plt.rcParams['figure.dpi'] = 500
plt.rcParams['savefig.dpi'] = 500
plt.rcParams['figure.figsize'] = [12, 3]
plt.rcParams['font.size'] = 10
2.2. Loading stock dataΒΆ
end_date = datetime.now()
start_date = datetime(end_date.year - 1,end_date.month, end_date.day)
stocks = ['AAPL','AMZN','GOOG','MSFT']
company_colours = {
'AAPL': '#d7372f',
'AMZN': '#ecaa01',
'GOOG': '#4bb0ac',
'MSFT': '#00a100',
}
for stock in stocks:
globals()[stock] = yf.download(stock, start = start_date, end = end_date);
display(AAPL.head(2))
display(AMZN.head(2))
display(GOOG.head(2))
display(MSFT.head(2))
[*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed
| Open | High | Low | Close | Adj Close | Volume | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2023-01-30 | 144.960007 | 145.550003 | 142.850006 | 143.000000 | 142.205154 | 64015300 |
| 2023-01-31 | 142.699997 | 144.339996 | 142.279999 | 144.289993 | 143.487961 | 65874500 |
| Open | High | Low | Close | Adj Close | Volume | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2023-01-30 | 101.089996 | 101.739998 | 99.010002 | 100.550003 | 100.550003 | 70691900 |
| 2023-01-31 | 101.160004 | 103.349998 | 101.139999 | 103.129997 | 103.129997 | 66527300 |
| Open | High | Low | Close | Adj Close | Volume | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2023-01-30 | 98.745003 | 99.408997 | 97.519997 | 97.949997 | 97.949997 | 24365100 |
| 2023-01-31 | 97.860001 | 99.910004 | 97.790001 | 99.870003 | 99.870003 | 22306800 |
| Open | High | Low | Close | Adj Close | Volume | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2023-01-30 | 244.509995 | 245.600006 | 242.199997 | 242.710007 | 240.576828 | 25867400 |
| 2023-01-31 | 243.449997 | 247.949997 | 242.949997 | 247.809998 | 245.632004 | 26541100 |
3. Basic analysis of stock informationΒΆ
3.1. Closing priceΒΆ
for i, stock_symbol in enumerate(stocks, start = 1):
df = globals()[stock_symbol]
sns.lineplot(data = df['Adj Close'], color = company_colours[stock_symbol], label = stock_symbol, linewidth = 1.5)
plt.title('Historical adjusted closing prices')
plt.xlabel('Date')
plt.ylabel('Adjusted closing price');
3.2. Trading volumeΒΆ
Trading volume
fig, ax = plt.subplots(nrows = 2, ncols = 2, sharey = True, sharex = True)
plt.suptitle('Historical trading volume')
plt.tight_layout()
for i, stock_symbol in enumerate(stocks, start = 1):
df = globals()[stock_symbol]
plt.subplot(2, 2, i)
sns.lineplot(data = df['Volume'], color = company_colours[stock_symbol], label = stock_symbol, linewidth = 1.5)
print(f'Average trading volume (trailing 12 months) for APPLE: {df['Volume'].mean():.0f} shares.')
Average trading volume (trailing 12 months) for APPLE: 58076802 shares. Average trading volume (trailing 12 months) for APPLE: 56535420 shares. Average trading volume (trailing 12 months) for APPLE: 24993570 shares. Average trading volume (trailing 12 months) for APPLE: 26910615 shares.
In terms of the volume of shares traded, Apple and Amazon surpassed Microsoft and Google by almost twice.
3.3. Moving averageΒΆ
A moving average (MA) is a widely used indicator in technical analysis that helps smooth out price action by filtering out the βnoiseβ from random price fluctuations. It is a trend-following, or lagging, indicator because it is based on past prices.
AAPL['MA 10 days'] = AAPL['Adj Close'].rolling(window = 10).mean()
AAPL['MA 20 days'] = AAPL['Adj Close'].rolling(window = 20).mean()
AAPL['MA 50 days'] = AAPL['Adj Close'].rolling(window = 50).mean()
sns.lineplot(data = AAPL['MA 10 days'], color = '#d7372f', label = '10-day MA', linewidth = 1.5)
sns.lineplot(data = AAPL['MA 20 days'], color = '#ecaa01', label = '20-day MA', linewidth = 1.5)
sns.lineplot(data = AAPL['MA 50 days'], color = '#4bb0ac', label = '50-day MA', linewidth = 1.5)
plt.title('Moving averages for APPLE stock')
plt.xlabel('Date')
plt.ylabel('Adjusted closing price');
fig, axes = plt.subplots(nrows=2, ncols=2, sharex=True)
plt.suptitle('Moving averages (trailing 12 months)')
plt.xlabel('Date')
for i, stock_symbol in enumerate(stocks, start = 1):
df = globals()[stock_symbol]
df['MA 10 days'] = df['Adj Close'].rolling(window=10).mean()
df['MA 20 days'] = df['Adj Close'].rolling(window=20).mean()
df['MA 50 days'] = df['Adj Close'].rolling(window=50).mean()
plt.subplot(2, 2, i)
sns.lineplot(data=df['MA 10 days'], color='#d7372f', label='10-day MA', linewidth=1.5)
sns.lineplot(data=df['MA 20 days'], color='#ecaa01', label='20-day MA', linewidth=1.5)
sns.lineplot(data=df['MA 50 days'], color='#4bb0ac', label='50-day MA', linewidth=1.5)
plt.ylabel(f'{stock_symbol} close')
plt.tight_layout()
3.4. Daily returnsΒΆ
Now that we've done some baseline analysis, let's go ahead and dive a little deeper. We're now going to analyze the risk of the stock. In order to do so we'll need to take a closer look at the daily changes of the stock, and not just its absolute value. Let's go ahead and use pandas to retrieve the daily returns for the Apple stock.
AAPL['Daily Return'] = AAPL['Adj Close'].pct_change()
sns.lineplot(
data = AAPL['Daily Return'],
color = company_colours['AAPL'])
plt.title('Daily returns of APPLE stock')
plt.xlabel('Date')
plt.ylabel('Return');
Now let's get an overall look at the average daily return using a histogram. We'll use seaborn to create both a histogram and kde plot on the same figure.
sns.histplot(
data = AAPL['Daily Return'],
alpha = 1,
color = company_colours['AAPL'],
kde = True)
plt.title('Daily returns of APPLE stock')
plt.xlabel('Return');
all_close = pd.concat([AAPL['Adj Close'], AMZN['Adj Close'], GOOG['Adj Close'], MSFT['Adj Close']], axis = 1)
all_close.columns = ['AAPL', 'AMZN', 'GOOG', 'MSFT']
all_close
| AAPL | AMZN | GOOG | MSFT | |
|---|---|---|---|---|
| Date | ||||
| 2023-01-30 | 142.205154 | 100.550003 | 97.949997 | 240.576828 |
| 2023-01-31 | 143.487961 | 103.129997 | 99.870003 | 245.632004 |
| 2023-02-01 | 144.621613 | 105.150002 | 101.430000 | 250.528580 |
| 2023-02-02 | 149.981689 | 112.910004 | 108.800003 | 262.274445 |
| 2023-02-03 | 153.641220 | 103.389999 | 105.220001 | 256.079346 |
| ... | ... | ... | ... | ... |
| 2024-01-22 | 193.889999 | 154.779999 | 147.710007 | 396.510010 |
| 2024-01-23 | 195.179993 | 156.020004 | 148.679993 | 398.899994 |
| 2024-01-24 | 194.500000 | 156.869995 | 150.350006 | 402.559998 |
| 2024-01-25 | 194.169998 | 157.750000 | 153.639999 | 404.869995 |
| 2024-01-26 | 192.419998 | 159.119995 | 153.789993 | 403.929993 |
250 rows Γ 4 columns
Now that we have all the closing prices, let's go ahead and get the daily return for all the stocks.
Now we can compare the daily percentage return of two stocks to check how correlated.
plt.figure(figsize = (3, 3));
sns.jointplot(
data = all_close,
x = 'AAPL',
y = 'AMZN',
alpha = 1,
color = '#d7372f');
<Figure size 1500x1500 with 0 Axes>
plt.figure(figsize = (3, 3));
sns.pairplot(
data = all_close,
plot_kws = {'s': 8});
<Figure size 1500x1500 with 0 Axes>
Finally, we could also do a correlation plot, to get actual numerical values for the correlation between the stocks' daily return values.
plt.figure(figsize = (3, 3));
corr = all_close.corr()
plt.figure(figsize = (3, 3))
sns.heatmap(corr, annot = True, cmap = 'crest');
<Figure size 1500x1500 with 0 Axes>
We've done some daily return analysis, let's go ahead and start looking deeper into actual risk analysis.
4. Risk analysis - Value at RiskΒΆ
4.1. Historical approachΒΆ
Let's go ahead and define a value at risk parameter for our stocks. We can treat value at risk as the amount of money we could expect to lose (aka putting at risk) for a given confidence interval. Theres several methods we can use for estimating a value at risk. Let's go ahead and see some of them in action.
all_returns = all_close.copy()
for column in ['AAPL', 'AMZN', 'GOOG', 'MSFT']:
all_returns[column] = all_returns[column].pct_change()
all_returns
| AAPL | AMZN | GOOG | MSFT | |
|---|---|---|---|---|
| Date | ||||
| 2023-01-30 | NaN | NaN | NaN | NaN |
| 2023-01-31 | 0.009021 | 0.025659 | 0.019602 | 0.021013 |
| 2023-02-01 | 0.007901 | 0.019587 | 0.015620 | 0.019935 |
| 2023-02-02 | 0.037063 | 0.073799 | 0.072661 | 0.046884 |
| 2023-02-03 | 0.024400 | -0.084315 | -0.032904 | -0.023621 |
| ... | ... | ... | ... | ... |
| 2024-01-22 | 0.012163 | -0.003605 | -0.001757 | -0.005418 |
| 2024-01-23 | 0.006653 | 0.008011 | 0.006567 | 0.006028 |
| 2024-01-24 | -0.003484 | 0.005448 | 0.011232 | 0.009175 |
| 2024-01-25 | -0.001697 | 0.005610 | 0.021882 | 0.005738 |
| 2024-01-26 | -0.009013 | 0.008685 | 0.000976 | -0.002322 |
250 rows Γ 4 columns
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(nrows = 2, ncols = 2, sharex = True)
plt.xlim(left = -0.1, right = 0.1)
plt.ylabel('Count')
for i, stock_symbol in enumerate(stocks, start = 1):
plt.subplot(2, 2, i)
plt.title(f'{stock_symbol} shares')
sns.histplot(data = all_returns[stock_symbol], alpha = 1, color = company_colours[stock_symbol], kde = True)
plt.tight_layout()
def VaR_historical(company):
"""
Computes Value at Risk for a company's returns at 3 conventional confidence levels & produce a histogram.
"""
alpha = [0.1, 0.05, 0.01]
current_price = all_close[company].sort_index(ascending = False).iloc[0]
var_historical = []
print(f'Stock: {company}')
print('VaR method: Historical')
print(f'Current price: ${current_price:.2f}')
print(f'Loss will not exceed:')
for i in alpha:
loss = all_returns[company].quantile(i)
nearest_higher = all_returns.loc[all_returns[company] < loss, company].sort_values(ascending = False).iloc[0]
print(f' * ${-nearest_higher*current_price:.2f} per share ({(1-i)*100:.0f}% confidence)')
var_historical.append({
'Confidence': f'{i*100:.0f}%',
'Loss%': loss,
'Nearest higher': nearest_higher
})
perc_95 = var_historical[1]['Loss%']
sns.histplot(data = all_returns[company], color = company_colours[company], alpha = 0.9, kde = True)
plt.axvline(x = perc_95, color = '#564c4d', linewidth = 1)
plt.annotate(f'VaR (95% confidence):\n {perc_95*100:.2f}%', xy = (perc_95, 28), weight = 'bold', color = '#232023',
xytext = (perc_95-0.035, 25), fontsize = 10, arrowprops = dict(arrowstyle = "->", color = '#808080'))
plt.xlim(left = -0.1, right = 0.1)
plt.xlabel(f'{company} Returns')
plt.title(f'{company} Historical Value at Risk')
VaR_historical('AAPL')
Stock: AAPL VaR method: Historical Current price: $192.42 Loss will not exceed: * $2.49 per share (90% confidence) * $3.32 per share (95% confidence) * $6.89 per share (99% confidence)
4.2. Parametric approachΒΆ
The main drawback of the parametric approach is that real world returns usually have a distribution with βfat tailsβ (high kurtosis).
https://medium.com/analytics-vidhya/value-at-risk-with-python-4e3409e1c23d
The Normal Distribution is not necessarily the best way to describe returns. However, it is a very good way to picture the concepts, and it is a good starting point for us to elaborate on more complex and realistic scenarios.
The main drawback of the parametric approach is that real world returns usually have a distribution with βfat tailsβ (high kurtosis).
The Parametric Model estimates VaR directly from the Standard Deviation of portfolio returns. It assumes that risk factor returns are normal (so risk factor levels are lognormal) and that portfolio returns are linear functions of the risk factors, and hence normal as well. The last assumption makes it easy to derive the formula for the portfolio distribution from the security distribution, without having to generate any distributions explicitly
The Parametric method is accurate for linear assets, but less accurate for options and other nonΒlinear derivatives. It also becomes less accurate at longer horizons. Parametric calculations are faster than either simulation method and don't need extensive historical data (only the correlation and volatility matrices are needed). The method is not recommended for long horizons, for portfolios with many options, or for assets with skewed distributions.
Variance-Covariance approach. The Variance-covariance is a parametric method which assumes that the returns are normally distributed.
def VaR_parametric(company):
"""
Computes Value at Risk for a company's returns at 3 conventional confidence levels & produce a histogram.
"""
alpha = [0.1, 0.05, 0.01]
current_price = all_close[company].sort_index(ascending = False).iloc[0]
var_parametric = []
print(f'Stock: {company}')
print('VaR method: Parametric')
print(f'Current price: ${current_price:.2f}')
print(f'Loss will not exceed:')
for i in alpha:
mean = np.mean(all_returns[company])
stdev = np.std(all_returns[company], axis = 0)
z_score = norm.ppf(1 - i)
var = -z_score * stdev
print(f' * ${-var * current_price:.2f} per share ({(1-i)*100:.0f}% confidence)')
var_parametric.append({
'Confidence': f'{(1 - i) * 100:.0f}%',
'Loss%': var
})
VaR_parametric('AMZN')
Stock: AMZN VaR method: Parametric Current price: $159.12 Loss will not exceed: * $4.08 per share (90% confidence) * $5.23 per share (95% confidence) * $7.40 per share (99% confidence)
monte carlo
Monte Carlo Simulation is similar to Historical Simulation in that it estimates VaR by simulating risk factor scenarios and revaluing all positions in a portfolio for each trial (i.e., full reΒpricing). However, instead of generating risk factor scenarios from the historical distribution, it generates them from a lognormal distribution. Thus, the distribution assumption for risk factors is the same as in the parametric model, but the method for generating the price distribution of the security is different. The method accounts for all nonΒlinearities in portfolio positions. As in the parametric method, volatilities and correlations for the risk factors are calculated directly from time series data over userΒspecified start and end dates. Users may also specify an optional decay factor, as well as the number of simulations to perform per analysis. This method accurately prices all types of complex nonΒlinear positions as well as simple linear instruments. It also provides a full distribution of potential portfolio gains and losses (which need not be symmetrical), but does not take into account any nonΒnormality in the underlying factors (e.g. fatΒtails, meanΒreversion, although Monte Carlo Simulation quantifies fatΒtailed risk only if scenarios are generated from appropriate conditions).
4.3. Monte Carlo simulation approachΒΆ
- take in a stocks starting price
- uses a loop to simulate the stock price for each day in the specified range (1 to days).
For each day:
- A random shock is generated using a normal distribution. This represents the unpredictable component of stock price movements.
- A drift is calculated based on the mean of daily returns. The drift represents the average expected movement in the stock price.
- The new stock price for the day is calculated using the previous day's price, drift, and shock
The function returns the array price, which contains the simulated stock prices for each day.
=> the function models the future stock prices by considering both a predictable component (drift) and an unpredictable component (shock) for each day. This process is repeated for the specified number of days, and the resulting array represents a possible trajectory of the stock prices based on the provided parameters.
The first term is known as "drift", which is the average daily return multiplied by the change of time. The second term is known as "shock", for each time period the stock will "drift" and then experience a "shock" which will randomly push the stock price up or down. By simulating this series of steps of drift and shock thousands of times, we can begin to do a simulation of where we might expect the stock price to be.
def price_montecarlo(company, days):
'''
Purpose: simulate the price of a stock over the course of given days
Input: company, days
'''
initial_price = all_close[company].sort_index(ascending = True).iloc[0]
results = pd.DataFrame(columns=['Time', 'Simulated price', 'Drift', 'Shock'])
results.loc[0] = [0, initial_price, 0, 0]
mean = np.mean(all_returns[company])
stdev = np.std(all_returns[company], axis = 0)
delta_t = 1 / days
for i in range(1, days):
results.loc[i, 'Time'] = i
results.loc[i, 'Drift'] = mean * delta_t
results.loc[i, 'Shock'] = np.random.normal(mean * delta_t, stdev * np.sqrt(delta_t))
results.loc[i, 'Simulated price'] = results['Simulated price'].iloc[i - 1] + (results['Simulated price'].iloc[i - 1] * (results['Drift'].iloc[i] + results['Shock'].iloc[i]))
return results
price_montecarlo('AMZN', 252)
| Time | Simulated price | Drift | Shock | |
|---|---|---|---|---|
| 0 | 0.0 | 100.550003 | 0.000000 | 0.000000 |
| 1 | 1.0 | 100.446265 | 0.000008 | -0.001040 |
| 2 | 2.0 | 100.433661 | 0.000008 | -0.000134 |
| 3 | 3.0 | 100.279795 | 0.000008 | -0.001540 |
| 4 | 4.0 | 100.135457 | 0.000008 | -0.001447 |
| ... | ... | ... | ... | ... |
| 247 | 247.0 | 99.378296 | 0.000008 | 0.000173 |
| 248 | 248.0 | 99.524103 | 0.000008 | 0.001459 |
| 249 | 249.0 | 99.546069 | 0.000008 | 0.000213 |
| 250 | 250.0 | 99.439005 | 0.000008 | -0.001084 |
| 251 | 251.0 | 99.707175 | 0.000008 | 0.002689 |
252 rows Γ 4 columns
- Next we're gonna run that simulation 20 times.
- Simulate the future stock prices using the Monte Carlo simulation function.
- Store the final stock price from that simulation in the simulations list.
- After running all simulations, calculate the 1% quantile. This is a value below which 1% of the simulated stock prices fall. Essentially, it gives you an idea of the potential downside risk in the simulated scenarios at a 99% confidence level.
# A visualisation of the price simulations using Monte Carlo over a low number of repetitions
company = 'AAPL'
days = 252
rep = 100
for r in range(rep):
sns.lineplot(
data = price_montecarlo(company, days),
y = 'Simulated price',
x = 'Time'
)
plt.xlabel("Days")
plt.ylabel("Price")
plt.title(f'Monte Carlo simulation for {company} stock price, {rep} repetitions');
Define a function that would return the predicted price for a much larger simulation sample size, then compute the Value at Risk.
def VaR_montecarlo(company, days, rep):
sim_results = []
current_price = all_close['AAPL'].sort_index(ascending = False).iloc[0]
initial_price = all_close['AAPL'].sort_index(ascending = True).iloc[0]
for i in range(1, rep+1):
prediction = price_montecarlo('AAPL', days).iloc[-1]['Simulated price']
sim_results.append({
'Simulation': i,
'Prediction': prediction,
'Loss': current_price - prediction
})
sim_results_df = pd.DataFrame(sim_results)
alpha = [0.1, 0.05, 0.01]
print(f'Stock: {company}')
print('VaR method: Monte Carlo')
print(f'Current price: ${current_price:.2f}')
print(f'Loss will not exceed:')
for i in alpha:
cutoff = sim_results_df['Prediction'].quantile(i)
var = current_price - cutoff
print(f' * ${var:.2f} per share ({(1-i)*100:.0f}% confidence)')
sns.histplot(
data = sim_results_df,
x = 'Loss',
color = company_colours[company],
alpha = 1
)
VaR_montecarlo('AAPL', 252, 20)
Stock: AAPL VaR method: Monte Carlo Current price: $192.42 Loss will not exceed: * $51.69 per share (90% confidence) * $51.97 per share (95% confidence) * $52.23 per share (99% confidence)
5. Recommendations for future workΒΆ
- Consider Weighted Moving Average Methods for VaR computation
Unlike the VaR methods used in this project, which assume uniform weights for all historical returns, weighted approaches such as the Exponential Weighted Moving Average Method (EWMA) assigns non-uniform weights, with a preference for more recent returns. The most recent returns have higher weights because they influence "today's" return more heavily than returns further in the past. This method proves advantageous in capturing varying influences of past returns on the current risk assessment.
- Include VaR backtesting
Various forms of backtesting could be used to assess and enhance the reliability of VaR models. One suggested approach is the Kupiec's proportion-of-failures (POF) test, which is a simple and straihtforward statistical method that evaluates the accuracy of VaR predictions.
Additionally, consider including a visual represtation of simulated VaR over an extended time horizon of all VaR approaches on the same graph. This would allow for a more comprehensive comparison of models.
Example: A combined plot depicting the returns and VaR estimates, all displayed at the 95% confidence level.
